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Abstract 

Faust 0.9.9.5 introduces new compilation options to do auto¬ 
matic parallelization of code using OpenMP. This paper ex¬ 
plains how the automatic parallelization is done and presents 
some benchmarks. 
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1 Introduction 

Faust is a programming language for real-time signal 
processing and synthesis designed from scratch to be a 
compiled language. Being efficiently compiled allows 
Faust to provide a viable high-level alternative to C/C++ 
to develop high-performance signal processing applica¬ 
tions, libraries or audio plug-ins. 

Until recently the computation code generated by the 
compiler was organized quite traditionally as a single 
sample processing loop. This scheme works very well 
but it doesn’t take advantages of multicore architectures. 
Moreover it can generate code that exceeds the autovec- 
torization capabilities of current C++ compilers. 

We have recently extended the compiler with two new 
schemes : the vector and the parallel schemes. The vec¬ 
tor scheme simplifies the autovectorization work of the 
C++ compiler by splitting the sample processing loop 
into several simpler loops. The parallel scheme analyzes 
the dependencies between these loops and add OpenMP 
pragmas to indicate those that can be computed in paral¬ 
lel. 

These new schemes can produce interesting performance 
improvements. The goal of the paper is to present 
these new compilation schemes and to provide some 
benchmarks comparing their performances. The paper 
is organized as follow : the next section will give a 
brief overview of Faust language. The third section will 
present the three code generation schemes and the last 
section will introduce the benchmarks used and the re¬ 
sults obtained. 


2 Faust overview 

In this section we give a brief overview of Faust with 
some examples of code. 

A Faust program describes a signal processor , some¬ 
thing that transforms some input signals and produces 
some output signals. The programming model used com¬ 
bines a functional programming approach with a block- 
diagram syntax. The functional programming approach 
provides a natural framework for signal processing. Dig¬ 
ital signals are modeled as discrete functions of time, and 
signal processors as second order functions that operate 
on them. Moreover Faust’s block-diagram composition 
operators , used to combine signal processors together, 
fit in the same picture as third order functions. 

The Faust compiler translates Faust programs into equiv¬ 
alent C++ programs. It uses several optimization tech¬ 
niques in order to generate the most efficient code. The 
resulting code can usually compete with, and sometimes 
outperform, DSP code directly written in C/C++. It is 
also self-contained and doesn’t depend on any DSP run¬ 
time library. 

Thanks to specific architecture files, a single Faust pro¬ 
gram can be used to produce code for a variety of plat¬ 
forms and plug-in formats. These architecture files act 
as wrappers and describe the interactions with the host 
audio and GUI system. Currently more than 10 architec¬ 
tures are supported (see Table 1) and new ones can be 
easily added. 


alsa-gtk.cpp 

ALSA application + GTK 

alsa-qt.cpp 

ALSA application + QT4 

jack-gtk.cpp 

JACK application + GTK 

jack-qt.cpp 

JACK application + QT4 

ca-qt.cpp 

CoreAudio application + QT4 

ladspa.cpp 

LADSPA plug-in 

max-msp.cpp 

Max MSP plug-in 

supercollider.cpp 

Supercollider plug-in 

vst.cpp 

VST plug-in 

q.cpp 

Q language plug-in 


Table 1: Some architecture files available for Faust 


In the following subsections we give a short and informal 
introduction to the language through the example of a 
simple noise generator. Interested readers can refer to 
[1] for a more complete description. 



2.1 A simple noise generator 

A Faust program describes a signal processor by 
combining primitive operations on signals (like 
/, y; sin, cos,...) using an algebra of high level 
composition operators [2] (see Table 2). You can think 
of these composition operators as a generalization of 
mathematical function composition fog. 

f ~ g recursive composition 

/ , g parallel composition 

/ : g sequential composition 

/ < : g split composition 
/ : > g merge composition 

Table 2: The five high level block-diagram composition 
operators used in Faust 


efficient code is not to compile the block diagram itself, 
but what it computes. 

Driven by the semantic rules of the language the com¬ 
piler starts by propagating symbolic signals into the 
block diagram, in order to discover how each output sig¬ 
nal can be expressed as a function of the input signals. 

These resulting signal expressions are then simplified 
and normalized, and common subexpressions are fac¬ 
torized. Finally these expressions are translated into a 
self contained C++ class that implements all the required 
computation. 

To compile our noise generator example we use the fol¬ 
lowing command : 

$ faust noise.dsp 


A Faust program is organized as a set of definitions with 
at least one for the keyword process (the equivalent of 

main in C). 

Our noise generator example noise . dsp only involves 
three very simple definitions. But it also shows some 
specific aspects of the language: 

random = +(12345) ~ *(1103515245); 

noise = random/2147483647.0; 

process = noise * vslider("noise", 0, 0, 

100 , 0 . 1 )/ 100 ; 


The first definition describes a (pseudo) random num¬ 
ber generator. Each new random number is computed by 
multiplying the previous one by 1103515245 and adding 
to the result 12345 . 

The expression +( 12345 ) denotes the operation of 
adding 12345 to a signal. It is an example of a com¬ 
mon technique in functional programming called par¬ 
tial application: the binary operation + is here pro¬ 
vided with only one of its arguments. In the same way 
*(1103515245) denotes the multiplication of a signal by 
1103515245. 

The two resulting operations are recursively composed 
using the ~ operator. This operator connects in a feed¬ 
back loop the output of +(12345) to the input of 
*(1103515245) (with an implicit 1-sample delay) 
and the output of *(1103515245) to the input of 
+(12345). 

The second definition transforms the random signal into 
a noise signal by scaling it between -1.0 and +1.0. 

Finally, the definition of process adds a simple user in¬ 
terface to control the production of the sound. The noise 
signal is multiplied by the value delivered by a slider to 
control its volume. 

2.2 Invoking the compiler 

The role of the compiler is to translate Faust programs 
into equivalent C++ programs. The key idea to generate 


This command generates the following C++ code on the 
standard output: 


class mydsp : public dsp { 
private: 

int iRecO[2]; 
float fsliderO; 
public: 

static void metadata(Meta* m) { 
} 


virtual int getNumlnputs( ) { return 0; } 

virtual int getNumOutputs() { return 1; } 

static void classlnit (int samplingFreq) { 

} 

virtual void instancelnit (int samplingFreq) 

{ 

fSamplingFreq = samplingFreq; 

for (int i=0; i<2; i++) iRecO[i] = 0; 

fsliderO = O.Of; 

} 

virtual void init (int samplingFreq) 

{ 

classlnit(samplingFreq) ; 
instancelnit(samplingFreq); 

} 

virtual void buildUserlnterface(UI* interface) 
{ 

interface->openVerticalBox("noise"); 
interface->declare(&fslider0, "style" 

, "knob"); 

interface->addVerticalSlider("noise", 

&fsliderO, O.Of, O.Of, 100.Of, O.lf); 
interface->closeBox (); 


virtual void compute (int count, 

float** input, 
float** output) 

{ 

float fSlowO = (4.656613e-12f * fsliderO); 
float* outputO = output [0]; 
for (int i=0; iccount; i++) { 

iRecO[0] = 12345+1103515245*iRec0[1]; 
outputO[i] = fSlow0*iRec0[0]; 

// post processing 
iRecO[1] = iRecO[0]; 

} 



The generated class contains seven methods. 






Among these methods getNumlnputs () and 
getNumOutputs () return the number of input 
and output signals required by our signal processor, 
init () initializes the internal state of the signal pro¬ 
cessor. buildUserlnterface () can be seen as a 
list of high level commands, independent of any toolkit, 
to build the user interface. The method compute () 
does the actual signal processing. It takes 3 arguments: 
the number of frames to compute, the addresses of the 
input buffers and the addresses of the output buffers, 
and computes the output samples according to the input 
samples. 

2.3 Generating a full application 

The f aust command accepts several options to control 
the generated code. Two of them are widely used. The 
option -o outputfile specifies the output file to be used 
instead of the standard output. The option -a architec- 
turefile defines the architecture file used to wrap the gen¬ 
erate C++ class. 

For example the command faust -a 
jack-qt.cpp -o noise.cpp noise.dsp 

generates a full jack application using QT4.4 as a 
graphic toolkit. The figure 1 is a screenshot of our noise 
application running. 



Figure 1: Screenshot of the noise example generated 
with the jack-qt. cpp architecture 

2.4 Generating a block-diagram 

Another interesting option is -svg that generates one or 
more SVG graphic files that represent the block-diagram 
of the program as in Figure 2. 

It is interesting to note the difference between the block 
diagram and the generated C++ code. The block dia¬ 
gram involves one addition, two multiplications and two 
divisions. The generated C++ program only involves one 
addition and two multiplications per samples. The com¬ 
piler was able to optimize the code by factorizing and 
reorganizing the operations. 

As already said, the key idea here is not to compile the 
block diagram itself, but what it computes. 

3 Code generation 

In this section we describe how the Faust compiler gener¬ 
ates its code. We will first introduce the so called scalar 
generation of code which was the only one until version 



Figure 2: Graphic block-diagram of the noise generator 
produced with the -svg option 

0.9.9.5. Then we will present the vector generation of 
code where the code is organized into several loops that 
operates on vectors, and finally the parallel generation 
of code where these vector loops are parallelized using 
OpenMP directives. 

3.1 Preliminary steps 

Before reaching the stage of the C++ code generation, 
the Faust compiler have to carry on several steps that we 
describe briefly here. 

3.1.1 Parsing source files 

The first one is to recursively parse all the source 
files involved. Each source file contains a set of def¬ 
initions and possibly some import directives for other 
source files. The result of this phase is a list of 
definitions : [(naroei = definitioni), (name 2 = 
definition ),...]. This list is actually a set, as redefi¬ 
nitions of symbols are not allowed. 

3.1.2 Evaluating block-diagrams 

Among the names defined there must be process, the ana¬ 
log of main in C/C++. This definition has to be evaluated 
as Faust allows algorithmic block-diagram definitions. 

For example the algorithmic definition : 

Listing 1: example of algorithmic definition 

foo(n) = *(10+n); 

process = par(i,3, foo(i)); 


will be translated in a flat block-diagram description that 
contains only primitive blocks: 

process = (_,10:*),(_,11:*),(_,12:*); 


This description is said to be in normal form. 
































3.1.3 Discovering the mathematical equations 


3.1.5 Occurrence analysis 


Faust doesn’t compile a block-diagram directly. It uses a 
phase of symbolic propagation to first discover its math¬ 
ematical semantic (what it computes). The principle is 
to propagate symbolic signals through the inputs of the 
block-diagram in order to get, at the other end, the math¬ 
ematical equation of each output signal. 

These equations are then normalized so that different 
block-diagrams, but computing mathematically equiva¬ 
lent signals, result in the same output equations. 


Here is a very simple example where the input signal is 
divided by 2 and then delayed by 10 samples: 



This is equivalent to having the input signal first multi¬ 
plied by 2, then delayed by 7 samples, then divided by 4 
and then delayed by 3 samples. 



Both lead to the following signal equation : 

Y(t) = 0.5 * X(t — 10) 


The role of this last preparation phase is to an¬ 
alyze in which context each subexpression is 
used and to discover common subexpressions. 
If an expensive common subexpression is dis¬ 
covered, an assignment to a cache variable 

float fTemp = <common subexpression code>; 

is generated, and the cache variable fTemp is used in 
its enclosing expressions. Otherwise the subexpression 
code is used in-lined. 

The occurrence analysis proceeds by a top-down visit of 
the signal expression. The first time a subexpression is 
visited, it is annotated with a counter. The next time, the 
counter will be increased and its visit skipped. 

Subexpressions with several occurrences are candidates 
to be cached in variables. However in some circum¬ 
stances expressions with a single occurrence need also to 
be cached if they occur in a faster context. For example a 
constant expression occurring in a low speed user inter¬ 
face expression or a user interface expression occurring 
in a high speed audio expression will generally required 
to be cached. 

It is only after this phase that the generation of the C++ 
code can start. 


Faust applies several rules to simplify and normalize out¬ 
put signal equations. For example one of theses rules 
says that it is better to multiply a signal by a constant af¬ 
ter a delay than before. It gives the compiler more oppor¬ 
tunities to share and reuse the same delay line. Another 
rule says that two consecutive delays can be combined 
into a single one. 

3.1.4 Typing the mathematical equations 

The next phase is to assign types to the resulting signal 
equations. This will not only help the compiler to de¬ 
tect errors but also to generate the most efficient code. 
Several aspects are considered : 

1 . the nature of the signal: integer of float. 

2 . interval of values of the signal : the minimum and 
maximum values that a signal can take 

3. the computation time of the signal: the signal can be 
computed at compilation time, at initialization time 
or at execution time. 

4. the speed of the signal : constant signals are com¬ 
puted only once, low speed user interface signals 
are computed once for every block of samples, high 
speed audio signals are computed every samples. 

5. parallelism of the signal : true if the samples of 
the signal can be computed in parallel, false when 
the signal have recursive dependencies requiring its 
samples to be computed sequentially. 


3.2 Scalar Code generation 

The generation of the C++ code is made by populating a 
klass object (representing a C++ class), with strings rep¬ 
resenting C++ declarations and lines of code. In scalar 
mode these lines of code are organized in a single sam¬ 
ple computation loop, while they can be splitted in sev¬ 
eral loops with the new vector and parallel schemes. 

The code generation relies basically on two functions: a 
translation function [ ]] that translate a signal expression 
into a string of C++ code, and a cache function C() that 
checks if a variable is needed. 

We don’t have the space to go in too much details but 
here is the translation rule for the addition of two signal 
expressions : 

- Si 

_ fgg! -+ S 2 _ 

\E\ + Efl\ —> C( ,, (S i + S 2 )") 

It says that to compile the addition of two signals we 
compile each of these signals and concat the resulting 
strings with a + sign in between. The string obtained is 
passed to the cache function that will check if the expres¬ 
sion is shared or not. 

Let say that the string passed to the cache func¬ 
tion C() is (input0[i] + inputi[i]). If the ex¬ 
pression is shared, the cache function will allo¬ 
cate a fresh variable name fTempo, add the line of 
code float fTempO = (input0[i] + inputl[i]); to 
the klass object and return fTempO as a string to be 





used when compiling enclosing expressions. If the ex¬ 
pression is not shared it will simply return the string 

(inputO[i] + inputi[i]) unmodified. 

To illustrate this, lets take two simple examples. The first 
one convert a stereo signal into a mono signal by adding 
the two input signals : 

process = +; 


In this case (inputO[i] + inputi[i]) is not shared and 
the generated C++ code is the following : 


virtual void compute (int count. 


float** 

input. 


float** 

output) 

i 

float* inputO 

= input[0]; 


float* inputi 

= input[1]; 


float* outputO 

= output[0]; 


for (int i=0; 

itcount; i++) 

{ 

outputO[i] = 

} 

} 

(inputO[i] + 

inputi[i]); 


But when the sum of the two input signals is duplicated 
on two output signals as in : 

process = + <: 


then (input0 [i] + inputi [i] ) will be cached in a tem¬ 
porary variable : 


virtual void compute (int count. 

float** 

input. 

float** 

/ 

output) 

i 

float* inputO = input[0]; 
float* inputi = input[1]; 
float* outputO = output[0]; 
float* output 1 = output[1]; 
for (int i=0; i<count; i++) 

{ 

float fTempO = (inputO[i] 
outputO[i] = fTempO; 
outputl[i] = fTempO; 

} 

} 

+ inputi[i]); 


3.3 Vector Code generation 

Modern C++ compiler are able to do autovectorization, 
that is to use SIMD instructions to speedup the code. 
These instructions can typically operate in parallel on 
short vectors of 4 simple precision floating point num¬ 
bers thus leading to a theoretical speedup of x4. Au¬ 
tovectorization of C/C+ programs is a difficult task. Cur¬ 
rent compilers are very sensitive to the way the code is 
arranged. In particular too complex loops can prevent 
autovectorization. The goal of the new vector code gen¬ 
eration is to rearrange the C++ code in a way that fa¬ 
cilitates the autovectorization job of the C++ compiler. 
Instead of generating a single sample computation loop, 
it splits the computation into several simpler loops that 
communicates by vectors. 


The vector code generation is activated by passing the 
— vectorize (or -vec) option to the Faust compiler. Two 
additional options are available : —vec-size <n> con¬ 
trols the size of the vector (by default 32 samples) and 
— loop-variant O/i gives some additional control on 
the loops. 

To illustrate the difference between scalar code and vec¬ 
tor code, let’s take the computation of the RMS (Root 
Mean Square) value of a signal. Here is the Faust code 
that computes the Root Mean Square of a sliding window 
of 1000 samples : 

// Root Mean Square of n consecutive samples 
RMS(n) = square : mean(n) : sqrt ; 

// Square of a signal 
square(x) = x * x ; 

// Mean of n consecutive samples of a signal 
// (uses fixpoint to avoid the accumulation of 
// rounding errors) 

mean(n) = float2fix : integrate(n) : 

fix2float : /(n) ; 

// Sliding sum of n consecutive samples 
integrate(n,x) = x - x@n : H—_ ; 

// Convertion between float and fix point 
float2fix(x) = int (x*(1<<20)); 
fix2float (x) = float (x)/(1<<20); 

// Root Mean Square of 1000 consecutive samples 
process = RMS(1000) ; 


The compute() method generated in scalar mode is the 
following : 

virtual void compute (int count, 

float** input, 
float** output) 

{ 

float* inputO = input[0]; 
float* outputO = output[0]; 
for (int i=0; i<count; i++) { 

float fTempO = inputO [i]; 
int iTempl = int (1048576*fTemp0*fTemp0); 
iVecO[IOTA&1023] = iTempl; 
iRecO[0] = ((iVecO[IOTA&1023] + iRec0[l]) 

- iVecO[(IOTA-1000)&1023] ) ; 
outputO[i] = sqrtf(9.536744e-10f * 

float (iRecO[0])); 

// post processing 
iRecO[1] = iRecO[0]; 

IOTA = IOTA+1; 



The -vec option leads to the following reorganization of 
the code : 

virtual void compute (int fullcount, 
float** input, 
float** output) 

{ 

int iRec0_tmp[32 + 4] ; 

int* iRecO = &iRec0_tmp[4]; 

for (int index=0; index<fullcount; index+=32) 

{ 

int count = min (32, fullcount-index); 









float* inputO = &input[0][index]; 
float* outputO = Soutput[0][index]; 
for (int i=0; i<4; i++) 

iRecO_tmp[i]=iRecO_perm[i]; 

// SECTION : 1 

for (int i=0; i<count; i++) { 

iYecO[(iYecO_idx+i)&2047] = 

int(1048576*input0[i]*input0[i]); 

} 

// SECTION : 2 

for (int i=0; i<count; i++) { 

iRecO[i] = ((iYecO[i] + iRecO[i-l]) - 

iYecO[(iYec0_idx+i-1000)&2047]) ; 

} 

// SECTION : 3 

for (int i=0; i<count; i++) { 

outputO[i] = sqrtf((9.536744e-10f * 
float(iRecO[i]))); 

} 

// SECTION : 4 

iYecO_idx = (iYecO_idx+count)&2047; 
for (int i=0; i<4; i++) 

iRecO_perm[i]=iRecO_tmp[count+i] ; 

} 

} 


While the second version of the code is more complex 
it turn out to be much easier to vectorize efficiently by 
the C++ compiler. Using Intel icc 11.0, with the ex¬ 
act same compilation options : -03 -xHost -ftz 
-fno-alias -fp-model fast=2, the scalar ver¬ 
sion leads to a throughput performance of 129.144 MB/s, 
while the vector version achieves 359.548 MB/s, a 
speedup of x2.8 ! 


parallel code generator 
(OpenMP directives) 


vector code generator 
(loop separation) 


scalar code generator 


Figure 3: Faust’s stack of code generators 

The vector code generation is built on top of the scalar 
code generation (see figure 3). Every time an expression 
needs to be compiled, the compiler checks to see if it 
needs to be in a separate loop or not. It applies some 
simple rules for that. Expressions that are shared (and 
are complex enough) are good candidates to be compiled 
in a separate loop, as well as recursive expressions and 
expressions used in delay lines. 


The result is a directed graph in which each node is a 
computation loop. This graph is stored in the klass object 
and a topological sort is applied to it before printing the 
code. 

3.4 Parallel Code generation 

The parallel code generation is activated by passing the 
—openMP (or -omp) option to the Faust compiler. It im¬ 
plies the -vec options as the parallel code generation is 
built on top of the vector code generation by inserting 
appropriate OpenMP directives in the C++ code. 

3.4.1 The OpenMP API 



Figure 4: OpenMP is based on a fork-join model 

OpenMP (http://wwww.openmp.org) is a well estab¬ 
lished API that is used to explicitly define direct multi¬ 
threaded, shared memory parallelism. It is based on a 
fork-join model of parallelism (see figure 4). Parallel re¬ 
gions are delimited by using the #pragma omp parallel 
construct. At the entrance of a parallel region a team of 
parallel threads is activated. The code within a parallel 
region is executed by each thread of the parallel team un¬ 
til the end of the region. 

#pragma omp parallel 
{ 

// the code here is executed simultaneously by 
// every thread of the parallel team 

} 


In order not to have every thread doing redundantly 















the exact same work, OpemMP provides specific work¬ 
sharing directives. For example #pragma omp sections 
allows to break the work into separate, discrete sections. 
Each section being executed by one thread : 

fpragma omp parallel 
{ 

#pragma omp sections 
1 

#pragma omp section 
{ 

// job 1 

} 

#pragma omp section 
{ 

// job 2 

} 

} 


3.4.2 Adding OpenMP directives 

As already said the parallel code generation is built on 
top of the vector code generation. The graph of loops 
produced by the vector code generator is topologically 
sorted in order to detect the loops that can be computed 
in parallel. The first set Lq contains the loops that don’t 
depend of any other loops, the set L \ contains the loops 
that only depend of loops of Lq, etc.. 

As all the loops of a given set L n can be computed in 
parallel, the compiler will generate a sections construct 
with a section for each loop. 

#pragma omp sections 
1 

#pragma omp section 

for (...) ( 

// Loop 1 

} 

#pragma omp section 

for (...) ( 

// Loop 2 

} 

} 


If a given set constains only one loop, then the compiler 
checks to see if the loop can be parallelized (no recursive 
dependencies) or not. If it can be parallelized, it gener¬ 
ates : 

#pragma omp for 
for (...) { 

// Loop code 

} 


otherwise it generates a single construct so that only one 
thread will execute the loop : 

#pragma omp single 

for (...) { 

// Loop code 

} 


3.4.3 Example of parallel code 

To illustrate how Faust utilises the OpenMP directives, 
here is a very simple example, two 1-pole filters in paral¬ 
lel connected to an adder (see figure 5 the corresponding 
block-diagram): 

filter (c) = *(1-c) : + ~ * (c) ; 

process = filter(0.9), filter(0.9) : + ; 


r process 


r filter(0.9) 



Figure 5: two filters in parallel connected to an adder 

The corresponding compute() method obtained using the 
-omp option is the following : 


virtual void compute (int fullcount, 
float** input, 
float** output) 

{ 

float fRecO_tmp[32+4]; 
float fRecl_tmp[32+4]; 
float* fRecO = &fRecO_tmp[4]; 
float* fRecl = &fRecl_tmp[4]; 

#pragma omp parallel firstprivate(fRecO,fRecl) 
{ 

for (int index = 0; index < fullcount; 

index += 32) 

{ 

int count = min (32, fullcount-index); 
float* inputO = &input[0][index]; 
float* input1 = &input[1][index]; 
float* outputO = &output[0][index]; 

#pragma omp single 
{ 

for (int i=0; i<4; i++) 

fRecO_tmp[i]=fRecO_perm[i]; 
for (int i=0; i<4; i++) 

fRecl_tmp[i]=fRecl_perm[i]; 

} 

// SECTION : 1 
#pragma omp sections 
{ 

#pragma omp section 
for (int i=0; i<count; i++) { 

fRecO[i] = ((O.lf * inputl[i]) 

+ (0.9f * fRecO[i-1])); 

} 

#pragma omp section 

for (int i=0; iccount; i++) { 































fRecl[i] = < < 0.1f * inputO[i]) 

+ (0.9f * fReel [i — 1])); 

} 

> 

// SECTION : 2 

#pragma omp for 

for (int i=0; i<count; i++) { 

outputO[i] = (fRecl[i] + fRecO[i]); 

} 

// SECTION : 3 
#pragma omp single 
{ 

for (int i=0; i<4; i++) 

fRecO_perm[i]=fRecO_tmp[count+i]; 
for (int i=0; i<4; ±++) 

fRecl_perm[i]=fRecl_tmp[count+i] ; 



This code appeals for some comments : 

1. The parallel construct #pragma omp parallel is 
the fundamental construct that starts parallel exe¬ 
cution. The number of parallel threads is generally 
the number of CPU cores but it can be controlled in 
several ways. 

2 . variables external to the parallel region are shared 
by default. The firstprivate(fRecO,fRecl) 
clause indicates that each thread should have its pri¬ 
vate copy of fRecO and fRec 1. The reason is that ac¬ 
cessing shared variables requires an indirection and 
is very inefficient compared to private copies. 

3. The top level loop for (int index = 0; ...).. . 
is executed by all threads simultaneously. The sub¬ 
sequent work-sharing directives inside the loop will 
indicate how the work must be shared between the 
threads. 

4. Please note that an implied barrier exists at the end 
of each work-sharing region. All threads must have 
executed the barrier before any of them can con¬ 
tinue. 

5. The work-sharing directive #pragma omp single 
indicates that this first section will be executed by 
only one thread (any of them). 

6 . The work-sharing directive #pragma omp sections 
indicates that each corresponding 
#pragma omp section, here OUT two filters, 
will be executed in parallel. 

7. The loop construct #pragma omp for specifies that 
the iterations of the associated loop will be executed 
in parallel. The iterations of the loop are distributed 
across the parallel threads. For example if we have 
two threads the first one can compute indices be¬ 
tween 0 and count/2 and the other between count/2 
and count. 

8 . Finally #pragma omp single in section 3 indicates 
that this last section will be executed by only one 
thread (any of them). 


4 Benchmarks 

To compare the performances of these three types of code 
generation in a realistic situation we have implemented a 
special alsa-gtk-bench.cpp architecture file that measures 
the duration of the compute)) method. Here is a fragment 
of this architecture file: 

while(running) { 
audio.read(); 

STARTMESURE 

DSP.compute(audio.buffering() , 

audio.inputSoftChannels() , 
audio.outputSoftChannels() 

) ; 

STOPMESURE 
audio.write() ; 

running = mesure <= (KMESURE + KSKIP); 

} 


The methodology is the following. The duration of the 
compute method is measured by reading the TSC (Time 
Stamp Counter) register. A total of 128+2048 measures 
are made by run. The first 128 measures are considered 
a warm-up period and are skipped. The median value of 
the following 2048 measures is computed. This median 
value, expressed in processors cycles, is first converted in 
a duration, and then in number of bytes produced per sec¬ 
ond considering the audio buffer size (in our test 2048) 
and the number of output channels. 

This throughput performance is a good indicator. The 
memory bandwidth is a strong limiting factor for today’s 
processors, and it has to be shared among the processors. 
In other words, on a SMP machine a realtime audio pro¬ 
gram can never go faster than the memory bandwidth. 
And if a sequential program already utilises all the avail¬ 
able memory bandwidth, there is no room for improve¬ 
ment. In this case a parallel version can only perform 
worth. 

4.1 Machines and compilers used 

In order to compare the scalar code generation with the 
new vector and parallel code generation we have com¬ 
piled with Faust 0.9.9.5b2 a series of test in three differ¬ 
ent versions. The following commands were used : 

- seal : faust -a alsa-gtk-bench.cpp 

test.dsp -o test.cpp 

- vec : faust -a alsa-gtk-bench.cpp 

-vec -vs 3968 test.dsp -o test.cpp 

- par : faust -a alsa-gtk-bench.cpp 

-omp -vs 3968 test.dsp -o test.cpp 

We have also used two different C++ compilers, GNU 
GCC and Intel ICC : 

- GCC version 4.3.2 with options : -03 

-march=native -mfpmath=sse 
-msse -msse2 -msse3 -ffast-math 




-ftree-vectorize. ( -fopenmp added for 
OpenMP). 

- ICC version 11.0.074 with options : -03 -xHost 
-ftz -fno-alias -fp-model fast=2. 
(-openmp is added for OpenMP). 

All the tests were run on three different machines : 

- vaio : a Sony Vaio SZ3VP laptop, with an Intel 
T7400 dual core processor at 2167 MHz, 2GB of 
Ram, running an Ubuntu 7.10 distribution with a 
2.6.22-15-generic kernel. 

- xps : a Dell XPS machine with an Intel Q9300 quad 
core processor at 2500 MHz, 4GB of Ram, run¬ 
ning an Ubuntu 8.10 distribution with a 2.6.22-15- 
generic kernel. 

- macpro : an Apple Macpro with two Intel Xeon 
X5365 quad core processors at 3000 MHz, 2GB of 
Ram, running an Ubuntu 8.10 distribution with a 
2.6.27-12-generic kernel 

4.2 Benchmark: copyl.dsp 

The goal of this first test is to measure the memory band¬ 
width. We use a very simple Faust program copyl.dsp 
that simply copies the input signal to the output signal: 


The results we have obtained are summarized figure 6. 
The horizontal axes corresponds to the three faust com¬ 
pilation schemes : scalar , vector and parallel, combined 
with the two C++ compilers : gcc and icc. The vertical 
axes is the throughput: how many bytes of samples each 
tested program is able to produce per second (higher val¬ 
ues are the better). 

It is interesting to note how catastrophic are the perfor¬ 
mances of the parallel versions. The scalar and vector 
versions are quite similar with a little advantage to the 
scalar version. The code generated by icc performs bet¬ 
ter. The memory bandwidth of the Macpro is disappoint¬ 
ing specially considering that it has to be shared by 8 
cores. 

How stable are these measures ? Figure 7 compares 
the performances of copyl (compiled with icc) on the 
Macpro on 5 different runs. As we can see the stability 
is reasonably good. 

4.3 Benchmark: freeverb.dsp 

The second test is freeverb.dsp, a Faust implementation 
of the Freeverb (the source can be found in the Faust dis¬ 
tribution). 

The results are given figure 8. Here gcc gives very good 
results in scalar code and outperforms icc in 2 of the 3 
cases. But the performances of gcc are still very poor on 
vector and parallel code. 
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Figure 6: Copyl.dsp benchmark 



Figure 7: Stability of measures (copyl on macpro, icc 
version) 

Despite the fact that freeverb has a limited amount of 
parallelism, icc gives quite convincing results with a rea¬ 
sonable speedup on vector and parallel code on the Vaio 
and the XPS machines. It is also interesting to note that 
on parallel version the 8 3GHz cores of the macpro were 
slower than 4 2.5Ghz cores of the XPS ! 

4.4 Benchmark: karplus32.dsp 

Karplus32.dsp is a generalized version of Karplus- 
Strong algorithm with 32 slightly detuned strings in par¬ 
allel (the source can be found in the Faust distribution). 
Figure 9 gives the results. Again excellent performances 
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Figure 8: Freeverb.dsp benchmark 


— * ( 


vslider("fader", 0, -60, 4, 
db21inear : smooth(0.99) ) 


0 . 1 ) 


mute = *(1 - checkbox("mute")); 


vumeter(x) 
with { 


attach(x, env(x) : 

vbargraph("",0,1)) 


env = abs:min(0.99):max ~ -(1.0/SR); 
}; 


pan = _ <: *(sqrt(1-c)), *(sqrt(c)) 

with { 

c = ( nentry("pan",0,-8,8,1)-8)/-16 : 
smooth(0.99 ) ; 

}; 


voice (v) = vgroup ( " voice Lj %v" , 

mute : 

hgroup("", vol : vumeter) : 
pan ) ; 

stereo = hgroup("stereo^out", vol, vol); 

process = hgroup("mixer", 

par(i,8,voice(i)) :> stereo); 


of gcc in scalar mode. Good progression of the perfor¬ 
mances in vector mode as well as in parallel mode for 
icc. 
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Figure 9: Karplus32.dsp benchmark 

4.5 Benchmark: mixer.dsp 

This is the implementation of a simple 8 channels mixer. 
Each channel has a mute button, a volume control in dB, 
a vumeter and a stereo pan control. The mixer has also a 
volume control of the stereo output. 

import("music.lib") ; 
smooth(c) = * (1—c) : +~*(c); 


The results of figure 10 show a real benefit for the vector¬ 
ized version with a speedup exceeding x2 on the 3 ma¬ 
chines. There is also a positive impact of the paralleliza¬ 
tion even if more limited. As usual gcc delivers good 
scalar code but poor results on vectorized and OpenMP 
code. 
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Figure 10: mixer.dsp benchmark 

4.6 Benchmark: fdelay8.dsp 

This test implements an 8-channels fractional delay. 
Each channel has a volume control in dB as well as a de¬ 
lay control in fractions of samples. The interpolation is 
based on a fifth-order Lagrange interpolation from Julius 
Smith’s Faust filter library. 



















































































import("filter.lib") ; 

line(i) = vgroup ( "line i _ i %i", fdelay5 (128, d) : * (g) ) 
with{ g = vslider("gain^(dB)",-60,-60,4,0.1) 

: db21inear : smooth(0.995); 
d = nentry("delay^(samp)", 10,10,128,0.1) 
: smooth(0.995); 

}; 

process = hgroupC", par(i, 8, line(i)) ); 


The results are presented figure 11. The Macpro exhibits 
a good speedup of x2.5 for its parallel version. The par¬ 
allel speedup for the XPS machine is more limited and 
there is no speedup at all on the Vaio. 
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Figure 11: fdelay8.dsp benchmark 

4.7 Benchmark: rms.dsp 

The Faust source of rms.dsp was presented section 3.3. 
It is a purely sequential algorithm therefore the perfor¬ 
mances of the parallel versions are very bad. But, as in¬ 
dicates figure 12, the vectorisation gives a real boost to 
the performances, particularly on the vaio. 

4.8 Benchmark: rms8.dsp 

This test computes the RMS value on 8 channels in par¬ 
allel. The Faust code is : 


process = par(i,8,component("rms.dsp")) ; 


We have obviously a good amount of parallelism here 
that icc is able to exploit as indicated by the results figure 
13. Compared to the scalar performances, the parallel 
version exhibits a speedup of nearly x3 on the Mac, while 
the speedup for the XPS exceed x2.5. But the record is 
for the Vaio with a speedup of x2.2 ! 
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Figure 12: rms.dsp benchmark 
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Figure 13: rms8.dsp benchmark 

5 Conclusion 

We have presented two new compilation schemes re¬ 
cently introduced in the Faust compiler. The vector 
scheme simplifies the autovectorization work of the C++ 
compiler by splitting the sample processing loop into 
several simpler loops. The parallel scheme analyzes 
the dependencies between these loops and add OpenMP 
pragmas to indicate those that can be computed in paral¬ 
lel. 

Figure 14 shows the speedup obtained with the vector¬ 
ized code. With a good autovectorizing C++ compiler 
like Intel icc 11.0 we can obtain very significants im- 
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Figure 14: Speedup ratio between vector and scalar code 
(using icc) 



Test 


Figure 15: Speedup ratio between parallel and scalar 
code (using icc) 


provements in many cases. On the contrary gcc 4.3.2 
was not able to generate SIMD instructions, leading to 
a degradation of the performances. We therefore highly 
recommend icc to compile vectorized code. That is a pity 
considering the excellent results of gcc on scalar code. 

Following the so called Amdahl’s law, the speedup ob¬ 
tained with the parallelized code is highly dependent of 
the quantity of parallelism available (see figure 15. On 
purely parallel programs like fdelay8 and rms8 a speedup 
exceeding x2.5 was observed on the mac. This is a lit¬ 
tle bit disappointing for a 8-cores machine, but in phase 
with its relatively limited memory bandwidth. Here too 
we recommend icc to compile OpenMP applications. 

Obviously all these results are dependent of many 
choices and settings, in particular compiler’s options. 
The options we have retained were the best we could 
find, but the parameters space is huge and we have only 
explored a little part of it. It may be the case that the gcc 
results could be improved by changing the settings. This 
would be a good news and the authors are interested by 
any suggestions on that point. 

There is also a lot of possible improvements in the code 
generated by Faust. While it is easy to discover all the 
potential parallelism of a Faust program 1 , generating ef¬ 
ficient OpenMP programs is much more difficult due to 
the overheads introduced and the additional pressure on 
the shared memory. 

The tradeoff between parallelism and overhead + mem¬ 
ory pressure is something that we will have to improve 
in future versions. It will be also interesting to explore 
the possibilities of GPGPU and their high-level program- 

1 parallel programming is probably the chance of functional pro¬ 
gramming languages compared to imperative languages 


ming languages as an alternative to C++ and OpenMP. 

Resources 

1 . http://openmp.org/ 

2 . http://faust.grame.fr 

3. http://www.intel.com/cd/software/products/asmo- 
na/eng/277618 .htm 
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